First, we load a few R packages

Attribution: A lot of the material for this lecture came from the following resources

Motivation

In this lecture, we will be asking the question:

Can we identify a voice as male or female, based upon acoustic properties of the voice and speech?

image source

Determining a person’s gender as male or female, based upon a sample of their voice seems to initially be an easy task. Often, the human ear can easily detect the difference between a male or female voice within the first few spoken words. However, designing a computer program to do this turns out to be a bit trickier.

To accomplish that goal, we will learn about another machine learning algorithmn called support vector machines (SVMs). SVMs have been around since the 1990s and originated of the computer science community. They are form of supervised learning

SVMs are widely applied to pattern classification and regression problems, such as:

  1. Handwritten digits classification
  2. Speech recognition (Building a model to recognize speech, accept key words and reject non-keywords)
  3. Facial expression classification
  4. Text classification

The original idea was to build a classifier for which training data can be seperated using some type of linear hyperplane. We want a hyperplane that maximizes the distance between the hyperplane to the nearest data point in either class.

image source

In the case when we cannot draw a linear hyperplane to separate the two classes of points (this is more typical), we can use adapt the idea and build a non-linear classifer. The key idea is to apply a “kernel trick”. We’ll learn more about that later in the lecture.

image source

Note: We will focus on the case when there are only two classes, but there are also extensions of SVMs in the case when there are more than two classes.

How does it work?

Given a dataset with a set of features and set of labels, we want to build a support vector machine (SVMs) to predict classes for new observations.

To understand what is a SVM let’s build up to it and consider some other types of classifiers (or hyperplanes) and how it relates to SVMs. First, let’s define what is a hyperplane.

Hyperplanes

A hyperplane is formally defined as a flat affine subspace of a dimension \(p-1\). e.g. In two dimensions, a hyperplane is a flat one-dimensional subspace (or a line). In this case, a hyperplane is defined by

\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0 \]

for \(X = (X_1, X_2)^{T}\) and for parameters \(\beta_0\), \(\beta_1\) and \(\beta_2\). If there are \(X = (X_1, X_2)^{T}\) that do not satisify the above, i.e

\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0 \]

or

\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 < 0 \]

Then, we can think of the hyperplane as dividing the two-dimensional space into two halves.

In the figure below, the hyperplane \(1 + 2X_1 + 3X_2 = 0\) is shown. The set of points in the blue region is \(1 + 2X_1 + 3X_2 > 0\) and the purple region is the set of points for which \(1 + 2X_1 + 3X_2 < 0\).

image source

More formally, let’s say we have a set of \(n\) training observations \(X_i = (X_{i1}, X_{i2})^T\) in with two features (\(p=2\)) and each training observation has a known label \(y_i \in \{-1,1\}\) where the observations from the blue class are labeled as \(y_i = 1\) and those from the purple class are \(y_i = -1\).

A hyperplane that separates the observations

\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0 \text{ if } y_i = 1 \]

or

\[ \beta_0 + \beta_1 X_1 + \beta_2 X_2 < 0 \text{ if } y_i = -1 \] or

\[ y_i (\beta_0 + \beta_1 X_1 + \beta_2 X_2) < 0 \text{ for all } i\in(1, \ldots, n) \]

There can be many hyperplanes that separate these points in our example. For example, the figure on the left shows three hyperplanes in black (out of many). If we pick one hyperplane, the figure on the right shows a grid of blue an purple points indicating the decision rule made by a classifer defined by this hyperplane.

image source

More formally, we can classify a test observation \(x^{*}\) based on the sign of of

\[ f(x^{*}) = \beta_0 + \beta_1 x_1^{*} + \beta_2 x_2^{*} \]

  • If \(f(x^{*})\) is positive, then we assign \(x^{*}\) to the blue class.
  • If \(f(x^{*})\) is negative, then we assign \(x^{*}\) to the purple class.

In addition to the sign, we can also consider the magnitude of \(f(x^{*})\).

  • If \(f(x^{*})\) is far from zero, then \(x^{*}\) is far away from the hyperplane (i.e. more confidence in our class assignment).
  • If \(f(x^{*})\) is close to zero, then \(x^{*}\) is close to the hyperplane (i.e. less certain about the class assignment for \(x^{*}\)).

But, the problem is this still can lead to an infinite number of possible separating hyperplanes. How can we decide what is the “best” hyperplane?

Maximal Margin Hyperplane (or Classifer)

The maximal margin hyperplane is the hyperlane that separates the farthest from training observations.

In the figure below, the maximal margin hyperplane is shown as a solid line. The margin is the distance from the solid line to either of the dashed lines. The two blue points and the purple point that lie on the dashed lines are the support vectors (they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well), and the distance from those points to the margin is indicated by arrows. The purple and blue grid indicates the decision rule made by a classifier based on this separating hyperplane.

image source

Note: although the maximal margin classifier is often successful, it can also lead to overfitting when \(p\) is large.

To construct a maximal margin classifier using \(n\) training observations \(x_1, \ldots, x_n\) and associated class labels \(y_1 \ldots, y_n \in \{-1, 1\}\), the maximal margin hyperplane is the solution to the optimization problem:

\[ \text{maximize}_{\beta_0, \beta_1, \ldots, \beta_p} M \] subject to \(\sum_{j=1}^p \beta_j^2 = 1\) and

\[ y_i (\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \geq M \text{ for all } i\in(1, \ldots, n) \]

This guarantees that each observation will be on the correct side of the hyperplane and at least a distance of \(M\) from the hyperplane. Therefore, you can think of \(M\) as the margin of our hyperplane.

This works great if a separating hyperplane exists. However, many times, that isn’t true and there is no solution with \(M > 0\). So instead we can try to find a hyperplane that almost separates the classes.

Support Vector Classifer

Consider the following data that cannot be separated by a hyperplane.

image source

We could consider building a support vector classifer or a soft margin classifer that misclassifies a few training observations in order to do a better job of classifying the remaining observations.

The margin is soft because it can be violated by some of the training observations. An observation can be not only on the wrong side of the margin, but also on the wrong side of the hyperplane.

image source

On the left there are observations that are on the right side of the hyperplane, but the wrong side of the margin. On the right are observations that are on the wrong side of the hyperplane and the wrong side of the margin.

In fact, when there is no separating hyperplane, such a situation is inevitable. Observations on the wrong side of the hyperplane correspond to training observations that are misclassified by the support vector classifier (i.e. right figure above).

Now, the optimization problem is:

\[ \text{maximize}_{\beta_0, \beta_1, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n} M \] subject to \(\sum_{j=1}^p \beta_j^2 = 1\)

\[ y_i (\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \geq M (1-\epsilon_i) \] for all \(i\in(1, \ldots, n)\), \(\epsilon_i \geq 0\), \(\sum_{i=1}^n \epsilon_i \leq C\) where \(C\) is a nonnegative tuning parameter (typically chosen using cross-validation). The \(\epsilon_1, \ldots, \epsilon_n\) are often called slack variables that allow observations to be on the wrong side of the margin or hyperplane.

We won’t go into the details, but \(C\) basically controls the bias-variance trade-off.

  • When \(C\) is small, we seek narrow margins that are rarely violated; this amounts to a classifier that is highly fit to the data, which may have low bias but high variance.
  • When \(C\) is larger, the margin is wider and we allow more violations to it; this amounts to fitting the data less hard and obtaining a classifier that is potentially more biased but may have lower variance.

Interestingly, it turns out that only the observations that lie on the margin or that violate the margin (also known as support vectors) will affect the hyperplane (and hence classification).

This make sense. When \(C\) is large, the margin is wide, and many observations violate the margin, thus there are many support vectors (potentially more bias, but less variance). When \(C\) is small, the margin is small, not many observations violate the margin, thus very few support vectors (potentially low bias and high variance).

image source

But what if we want to consider non-linear boundaries?

Support Vector Classifier with Non-Linear boundaries

Consider the following data on the left plot. A linear support classifier (applied in the right plot) will perform poorly.

image source

A solution to this problem is to enlarge the feature space using functions of the predictors (i.e. quadratic and cubic terms or higher) in order to address the non-linearity.

So instead of fitting a support vector classifier with \(p\) features \((X_1, X_2, \ldots, X_p)\), we could try using \(2p\) features \((X_1, X_1^2, X_2, X_2^2, \ldots, X_p, X_p^2)\).

Why does this lead to a non-linear boundary?

In the enlarged feature space, the decision boundary that is found is still linear. But in the original feature space, the decision boundary is of the form \(q(x) = 0\), where \(q\) is a quadratic polynomial, and its solutions are generally non-linear.

As you can imagine, there are many ways to enlarge the feature space e.g. include higher-order polynomial terms or interaction terms such as \(X_1 X_2\). We could easily end up with a large number of features leading to unmanagable computations.

In the next section, we will learn about the support vector machine that allows us to enlarge the feature space in an efficient way.

Support Vector Machines

The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels.

The details of how exactly how the support vector classifier is computed is quite technical, so I won’t go into here. However, it’s sufficient to know that the solution to the support vector classifier problem involves only the inner products of the observations (as opposed to the observations themselves). The inner product of two observations \(x_i\) and \(x_{i^{'}}\) is given by

\[ \langle x_i, x_{i^{'}} \rangle = \sum_{j=1}^P x_{ij} x_{i^{'}j}\]

For example, the linear support vector classifier can be represented as

\[ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_{i} \rangle \]

where there are \(n\) parameters \(\alpha_i\) (one per training observation). I won’t go into the details here.

However, now suppose that instead of the inner product, we consider a generalization of the inner product of the form

\[ K( x_i, x_{i^{'}} ) \]

where \(K\) is some function called a kernel.

You can think of a kernel as a function that quantifies the similiarity of two observations. For example,

\[ K( x_i, x_{i^{'}} ) = \sum_{j=1}^p x_{ij} x_{i^{'}j} \]

is a linear kernel (linear in the features) and would return the support vector classifier. In contrast, this kernel is called a polynomial kernel of degree \(d\). If \(d > 1\), then the support vector classifier results in a more flexible boundaary.

\[ K( x_i, x_{i^{'}} ) = \Big(1 + \sum_{j=1}^p x_{ij} x_{i^{'}j} \Big)^d \]

image source

When the support vector classifier is combined with non-linear kernels (such as above), the resulting classifier is known as a support vector machine.

Another popular kernel is the radial kernel:

\[ K( x_i, x_{i^{'}} ) = \exp \Big(-\gamma \sum_{j=1}^p (x_{ij} - x_{i^{'}j})^2 \Big) \]

image source

Advantages

  1. SVMs are effective when the number of features is quite large.
  2. It works effectively even if the number of features are greater than the number of samples.
  3. Non-Linear data can also be classified using customized hyperplanes built by using kernel trick.
  4. It is a robust model to solve prediction problems since it maximizes margin.

Disadvantages

  1. The biggest limitation of SVMs is the choice of the kernel. The wrong choice of the kernel can lead to an increase in error percentage.
  2. With a greater number of samples, it can result in poor performance.
  3. SVMs have good generalization performance but they can be extremely slow in the test phase.
  4. SVMs have high algorithmic complexity and extensive memory requirements due to the use of quadratic programming.

Let’s try out these concepts on the data from our original question:

Can we identify a voice as male or female, based upon acoustic properties of the voice and speech?

Data

The data we will use is from kaggle and is available in a .csv file.

A description of the data from Kaggle:

“This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages.”

We can actually dig a bit deeper and go to the website where the data origianlly came from to learn more about how the dataset was created:

“Each voice sample is stored as a .WAV file, which is then pre-processed for acoustic analysis using the specan function from the WarbleR R package. Specan measures 22 acoustic parameters on acoustic signals for which the start and end times are provided.”

“The output from the pre-processed WAV files were saved into a CSV file, containing 3168 rows and 21 columns (20 columns for each feature and one label column for the classification of male or female).”

The following acoustic properties of each voice are measured (described on Kaggle’s website):

Variable Description
meanfreq mean frequency (in kHz)
sd standard deviation of frequency
median median frequency (in kHz)
Q25 first quantile (in kHz)
Q75 third quantile (in kHz)
IQR interquantile range (in kHz)
skew skewness
kurt kurtosis
sp.ent spectral entropy
sfm spectral flatness
mode mode frequency
centroid frequency centroid
peakf peak frequency (frequency with highest energy)
meanfun average of fundamental frequency measured across acoustic signal
minfun minimum fundamental frequency measured across acoustic signal
maxfun maximum fundamental frequency measured across acoustic signal
meandom average of dominant frequency measured across acoustic signal
mindom minimum of dominant frequency measured across acoustic signal
maxdom maximum of dominant frequency measured across acoustic signal
dfrange range of dominant frequency measured across acoustic signal
modindx modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range
label male or female

Data import

Let’s read in the voice.csv file into R using the read_csv() function in the readr R package.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   label = col_character()
## )
## See spec(...) for full column specifications.
## # A tibble: 3,168 x 21
##    meanfreq     sd median     Q25    Q75    IQR  skew   kurt sp.ent   sfm
##       <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>
##  1   0.0598 0.0642 0.0320 0.0151  0.0902 0.0751 12.9  2.74e2  0.893 0.492
##  2   0.0660 0.0673 0.0402 0.0194  0.0927 0.0733 22.4  6.35e2  0.892 0.514
##  3   0.0773 0.0838 0.0367 0.00870 0.132  0.123  30.8  1.02e3  0.846 0.479
##  4   0.151  0.0721 0.158  0.0966  0.208  0.111   1.23 4.18e0  0.963 0.727
##  5   0.135  0.0791 0.125  0.0787  0.206  0.127   1.10 4.33e0  0.972 0.784
##  6   0.133  0.0796 0.119  0.0680  0.210  0.142   1.93 8.31e0  0.963 0.738
##  7   0.151  0.0745 0.160  0.0929  0.206  0.113   1.53 5.99e0  0.968 0.763
##  8   0.161  0.0768 0.144  0.111   0.232  0.121   1.40 4.77e0  0.959 0.720
##  9   0.142  0.0780 0.139  0.0882  0.209  0.120   1.10 4.07e0  0.971 0.771
## 10   0.134  0.0804 0.121  0.0756  0.202  0.126   1.19 4.79e0  0.975 0.805
## # ... with 3,158 more rows, and 11 more variables: mode <dbl>,
## #   centroid <dbl>, meanfun <dbl>, minfun <dbl>, maxfun <dbl>,
## #   meandom <dbl>, mindom <dbl>, maxdom <dbl>, dfrange <dbl>,
## #   modindx <dbl>, label <chr>

Next, let’s get an overall summary of the range of values in the dataset.

##     meanfreq             sd              median             Q25           
##  Min.   :0.03936   Min.   :0.01836   Min.   :0.01097   Min.   :0.0002288  
##  1st Qu.:0.16366   1st Qu.:0.04195   1st Qu.:0.16959   1st Qu.:0.1110865  
##  Median :0.18484   Median :0.05916   Median :0.19003   Median :0.1402864  
##  Mean   :0.18091   Mean   :0.05713   Mean   :0.18562   Mean   :0.1404556  
##  3rd Qu.:0.19915   3rd Qu.:0.06702   3rd Qu.:0.21062   3rd Qu.:0.1759388  
##  Max.   :0.25112   Max.   :0.11527   Max.   :0.26122   Max.   :0.2473469  
##       Q75               IQR               skew              kurt         
##  Min.   :0.04295   Min.   :0.01456   Min.   : 0.1417   Min.   :   2.068  
##  1st Qu.:0.20875   1st Qu.:0.04256   1st Qu.: 1.6496   1st Qu.:   5.670  
##  Median :0.22568   Median :0.09428   Median : 2.1971   Median :   8.319  
##  Mean   :0.22476   Mean   :0.08431   Mean   : 3.1402   Mean   :  36.569  
##  3rd Qu.:0.24366   3rd Qu.:0.11418   3rd Qu.: 2.9317   3rd Qu.:  13.649  
##  Max.   :0.27347   Max.   :0.25223   Max.   :34.7255   Max.   :1309.613  
##      sp.ent            sfm               mode           centroid      
##  Min.   :0.7387   Min.   :0.03688   Min.   :0.0000   Min.   :0.03936  
##  1st Qu.:0.8618   1st Qu.:0.25804   1st Qu.:0.1180   1st Qu.:0.16366  
##  Median :0.9018   Median :0.39634   Median :0.1866   Median :0.18484  
##  Mean   :0.8951   Mean   :0.40822   Mean   :0.1653   Mean   :0.18091  
##  3rd Qu.:0.9287   3rd Qu.:0.53368   3rd Qu.:0.2211   3rd Qu.:0.19915  
##  Max.   :0.9820   Max.   :0.84294   Max.   :0.2800   Max.   :0.25112  
##     meanfun            minfun             maxfun          meandom        
##  Min.   :0.05557   Min.   :0.009775   Min.   :0.1031   Min.   :0.007812  
##  1st Qu.:0.11700   1st Qu.:0.018223   1st Qu.:0.2540   1st Qu.:0.419828  
##  Median :0.14052   Median :0.046110   Median :0.2712   Median :0.765795  
##  Mean   :0.14281   Mean   :0.036802   Mean   :0.2588   Mean   :0.829211  
##  3rd Qu.:0.16958   3rd Qu.:0.047904   3rd Qu.:0.2775   3rd Qu.:1.177166  
##  Max.   :0.23764   Max.   :0.204082   Max.   :0.2791   Max.   :2.957682  
##      mindom             maxdom             dfrange          modindx       
##  Min.   :0.004883   Min.   : 0.007812   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.007812   1st Qu.: 2.070312   1st Qu.: 2.045   1st Qu.:0.09977  
##  Median :0.023438   Median : 4.992188   Median : 4.945   Median :0.13936  
##  Mean   :0.052647   Mean   : 5.047277   Mean   : 4.995   Mean   :0.17375  
##  3rd Qu.:0.070312   3rd Qu.: 7.007812   3rd Qu.: 6.992   3rd Qu.:0.20918  
##  Max.   :0.458984   Max.   :21.867188   Max.   :21.844   Max.   :0.93237  
##     label          
##  Length:3168       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

A quick glimpse over the data shows us that we have 20 numeric columns with differing ranges and magnitudes.

Data wrangling

It would be nice to get a picture of how these features are different across the male and female observations. One way to do that is to use ggplot() to explore differences in distribution with boxplots and histograms.

First, let’s transform the data fraom a wide format to a long format using the gather() function in the tidyr package.

## # A tibble: 6 x 3
##   label feature   value
##   <chr> <chr>     <dbl>
## 1 male  meanfreq 0.0598
## 2 male  meanfreq 0.0660
## 3 male  meanfreq 0.0773
## 4 male  meanfreq 0.151 
## 5 male  meanfreq 0.135 
## 6 male  meanfreq 0.133

We also can transform the label column which contains male and female charater strings into 1s and 0s where 1 represents male and 0 represents `female.

## 
## female   male 
##   1584   1584

Just as a sanity check:

##             
## voice_labels    0    1
##       female 1584    0
##       male      0 1584

Whew ok good!

Exploratory data analyses

If we wanted to create boxplots of all twenty variables colored by whether the observation was male or female, we can use the

These are great to look at the distributions separately, but it would also be good to get an idea of how the features are related to each other.

To do that, another useful plotting function for exploratory data analysi is the ggpairs()function from the GGally package:

Classification models

Next, we will build a few models to classify the recorded voice samples as male or female using features available. First, we will look at SVMs and then we will compare to other models that we have already seen that are useful for classification including logistic regression and random forests.

Support Vector Machines

Before we build an SVM classifier, let’s split our data into a train_set and test_set using the createDataParition() function in the caret R package.

We’ll just split it in half for the purposes of the lecture.

We can look at the dimensions of the two datasets to make sure they have been split in half.

## [1] 1584   21
## [1] 1584   21

And they have! Ok, before we build a SVM using train() function (we’ve seen this before), let’s use the trainControl() function. Here, we select method=cv with 10-fold cross-validation.

SVM with linear kernel

First, we will use the train() function from the caret R package with the argument method=svmLinear to build a SVM with linear kernel.

## Support Vector Machines with Linear Kernel 
## 
## 1584 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1426, 1426, 1425, 1425, 1426, 1426, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9760129  0.9520249
## 
## Tuning parameter 'C' was held constant at a value of 1

Now that the SVM has been built on our train_set, we can classify the recorded voice samples in our test_data using the predict() function.

We can also look at the confusion matrix and statistics.

## Confusion Matrix and Statistics
## 
##     truth
## pred   0   1
##    0 784  35
##    1   8 757
##                                           
##                Accuracy : 0.9729          
##                  95% CI : (0.9636, 0.9803)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9457          
##  Mcnemar's Test P-Value : 7.341e-05       
##                                           
##             Sensitivity : 0.9899          
##             Specificity : 0.9558          
##          Pos Pred Value : 0.9573          
##          Neg Pred Value : 0.9895          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4949          
##    Detection Prevalence : 0.5170          
##       Balanced Accuracy : 0.9729          
##                                           
##        'Positive' Class : 0               
## 

SVM with polynomial kernel

Next, we will use the train() function from the caret R package with the argument method=svmPoly to build a SVM with polynomial kernel.

## Support Vector Machines with Polynomial Kernel 
## 
## 1584 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1426, 1425, 1426, 1426, 1425, 1425, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa    
##   1       0.001  0.25  0.8857615  0.7715693
##   1       0.001  0.50  0.8908168  0.7816900
##   1       0.001  1.00  0.9040761  0.8082178
##   1       0.010  0.25  0.9520301  0.9040708
##   1       0.010  0.50  0.9677971  0.9356012
##   1       0.010  1.00  0.9696879  0.9393813
##   1       0.100  0.25  0.9728525  0.9457090
##   1       0.100  0.50  0.9734814  0.9469666
##   1       0.100  1.00  0.9734814  0.9469666
##   2       0.001  0.25  0.8901839  0.7804242
##   2       0.001  0.50  0.9047050  0.8094788
##   2       0.001  1.00  0.9425683  0.8851509
##   2       0.010  0.25  0.9690471  0.9380974
##   2       0.010  0.50  0.9703129  0.9406277
##   2       0.010  1.00  0.9734734  0.9469478
##   2       0.100  0.25  0.9823024  0.9646040
##   2       0.100  0.50  0.9791577  0.9583150
##   2       0.100  1.00  0.9778959  0.9557917
##   3       0.001  0.25  0.8996656  0.7993887
##   3       0.001  0.50  0.9198511  0.8397474
##   3       0.001  1.00  0.9640076  0.9280209
##   3       0.010  0.25  0.9728405  0.9456820
##   3       0.010  0.50  0.9766261  0.9532521
##   3       0.010  1.00  0.9760051  0.9520103
##   3       0.100  0.25  0.9810485  0.9620964
##   3       0.100  0.50  0.9804156  0.9608310
##   3       0.100  1.00  0.9785208  0.9570411
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 2, scale = 0.1 and C
##  = 0.25.

Now that the SVM has been built on our train_set, we can classify the recorded voice samples in our test_data using the predict() function.

We can also look at the confusion matrix and statistics.

## Confusion Matrix and Statistics
## 
##     truth
## pred   0   1
##    0 782  32
##    1  10 760
##                                           
##                Accuracy : 0.9735          
##                  95% CI : (0.9643, 0.9808)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.947           
##  Mcnemar's Test P-Value : 0.001194        
##                                           
##             Sensitivity : 0.9874          
##             Specificity : 0.9596          
##          Pos Pred Value : 0.9607          
##          Neg Pred Value : 0.9870          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4937          
##    Detection Prevalence : 0.5139          
##       Balanced Accuracy : 0.9735          
##                                           
##        'Positive' Class : 0               
## 

SVM with radial basis kernel

Next, we will use the train() function from the caret R package with the argument method=svmRadial to build a SVM with radial basis kernel.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1584 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1426, 1426, 1425, 1425, 1426, 1424, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.9734931  0.9469865
##   0.50  0.9798063  0.9596128
##   1.00  0.9829669  0.9659335
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05273699
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05273699 and C = 1.

Now that the SVM has been built on our train_set, we can classify the recorded voice samples in our test_data using the predict() function.

We can also look at the confusion matrix and statistics.

## Confusion Matrix and Statistics
## 
##     truth
## pred   0   1
##    0 784  25
##    1   8 767
##                                           
##                Accuracy : 0.9792          
##                  95% CI : (0.9709, 0.9856)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9583          
##  Mcnemar's Test P-Value : 0.005349        
##                                           
##             Sensitivity : 0.9899          
##             Specificity : 0.9684          
##          Pos Pred Value : 0.9691          
##          Neg Pred Value : 0.9897          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4949          
##    Detection Prevalence : 0.5107          
##       Balanced Accuracy : 0.9792          
##                                           
##        'Positive' Class : 0               
## 

Logistic regression

Now let’s compare to some other classification approaches that we have learned about.

First, let’s try logistic regression.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8194  -0.0169   0.0001   0.0794   4.7018  
## 
## Coefficients: (3 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.641e+00  1.575e+01  -0.104  0.91702    
## meanfreq     3.990e+01  7.280e+01   0.548  0.58358    
## sd          -3.694e+01  5.572e+01  -0.663  0.50735    
## median      -1.995e+01  2.096e+01  -0.952  0.34110    
## Q25         -8.262e+01  1.889e+01  -4.375 1.22e-05 ***
## Q75          8.243e+01  3.143e+01   2.622  0.00873 ** 
## IQR                 NA         NA      NA       NA    
## skew        -1.545e-01  3.251e-01  -0.475  0.63463    
## kurt         7.587e-04  8.956e-03   0.085  0.93249    
## sp.ent       3.177e+01  1.669e+01   1.903  0.05703 .  
## sfm         -6.693e+00  3.857e+00  -1.735  0.08267 .  
## mode         2.249e+00  3.446e+00   0.653  0.51405    
## centroid            NA         NA      NA       NA    
## meanfun     -1.998e+02  1.666e+01 -11.992  < 2e-16 ***
## minfun       6.018e+01  1.222e+01   4.924 8.47e-07 ***
## maxfun      -2.269e+01  1.183e+01  -1.919  0.05504 .  
## meandom     -3.142e-02  6.783e-01  -0.046  0.96306    
## mindom      -2.382e+00  3.212e+00  -0.742  0.45834    
## maxdom      -6.125e-02  9.788e-02  -0.626  0.53146    
## dfrange             NA         NA      NA       NA    
## modindx     -6.787e+00  2.490e+00  -2.726  0.00642 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2195.89  on 1583  degrees of freedom
## Residual deviance:  222.32  on 1566  degrees of freedom
## AIC: 258.32
## 
## Number of Fisher Scoring iterations: 9
## Confusion Matrix and Statistics
## 
##     truth
## pred   0   1
##    0 783  35
##    1   9 757
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.9629, 0.9797)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9444          
##  Mcnemar's Test P-Value : 0.000164        
##                                           
##             Sensitivity : 0.9886          
##             Specificity : 0.9558          
##          Pos Pred Value : 0.9572          
##          Neg Pred Value : 0.9883          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4943          
##    Detection Prevalence : 0.5164          
##       Balanced Accuracy : 0.9722          
##                                           
##        'Positive' Class : 0               
## 

That’s actually not so bad.

Random Forests

Next let’s try random forests.

## Confusion Matrix and Statistics
## 
##     truth
## pred   0   1
##    0 781  25
##    1  11 767
##                                          
##                Accuracy : 0.9773         
##                  95% CI : (0.9687, 0.984)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.9545         
##  Mcnemar's Test P-Value : 0.03026        
##                                          
##             Sensitivity : 0.9861         
##             Specificity : 0.9684         
##          Pos Pred Value : 0.9690         
##          Neg Pred Value : 0.9859         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4931         
##    Detection Prevalence : 0.5088         
##       Balanced Accuracy : 0.9773         
##                                          
##        'Positive' Class : 0              
## 

I’m forgetting where the performance of a random forest model compared to everyone else. So let’s take a closer look.

## 
## Call:
## summary.resamples(object = class_results)
## 
## Models: glm, rf, fit_svmLinear, fit_svmPoly, fit_svmRadial 
## Number of resamples: 10 
## 
## Accuracy 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## glm           0.9556962 0.9684042 0.9747632 0.9747353 0.9857595 0.9874214
## rf            0.9556962 0.9700860 0.9810127 0.9779118 0.9858491 0.9936709
## fit_svmLinear 0.9620253 0.9620850 0.9747632 0.9760129 0.9858188 1.0000000
## fit_svmPoly   0.9556962 0.9762658 0.9873418 0.9823024 0.9874214 1.0000000
## fit_svmRadial 0.9683544 0.9748821 0.9841772 0.9829669 0.9921085 1.0000000
##               NA's
## glm              0
## rf               0
## fit_svmLinear    0
## fit_svmPoly      0
## fit_svmRadial    0
## 
## Kappa 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## glm           0.9113924 0.9368040 0.9495253 0.9494684 0.9715190 0.9748458
## rf            0.9113924 0.9401738 0.9620253 0.9558242 0.9716962 0.9873418
## fit_svmLinear 0.9240506 0.9241663 0.9495293 0.9520249 0.9716377 1.0000000
## fit_svmPoly   0.9113924 0.9525316 0.9746835 0.9646040 0.9748408 1.0000000
## fit_svmRadial 0.9367089 0.9497627 0.9683544 0.9659335 0.9842168 1.0000000
##               NA's
## glm              0
## rf               0
## fit_svmLinear    0
## fit_svmPoly      0
## fit_svmRadial    0

So it looks like SVM does give us a bit of a performance boost over logistic regression or random forests.

What about bagging or boosting? I will leave it for you to try!